LLM vs. Human Ratings

Prereqquarto render numerical_ratings.qmd must have created
results/metrics_long.csv.
Add your hand-coded spreadsheet of human scores as results/human_ratings.csv with columns:

paper, metric, midpoint_human

⚠️  Human titles without LLM id (add rows to UJ_map.csv):
   • A Welfare Analysis of Policies
Impacting Climate Change
   • Biodiversity risk
   • Building Resilient Education Systems: Evidence from Large-Scale Randomized Trials in Five Countries
   • Cognitive Behavioral Therapy among Ghana’s Rural Poor Is Effective Regardless of Baseline Mental Distress
   • Does Conservation Work in General Equilibrium?
   • Forecasting Existential Risks: Evidence From a Long-Run Forecasting Tournament
   • How Effective Is (More) Money? Randomizing Unconditional Cash Transfer Amounts in the US
   • How Much Would Reducing Lead Exposure Improve Children’s Learning in the Developing World?
   • Intergenerational Child Mortality Impacts of Deworming: Experimental Evidence from Two Decades of the Kenya Life Panel Survey

   • Long term cost-effectiveness of resilient foods for global catastrophes compared to artificial general intelligence safety
   • The Benefits and Costs of Guest Worker Programs: Experimental Evidence From the India-UAE Migration Corridor
   • The global potential for natural regeneration in deforested tropical regions
   • The Long-Run Effects of Psychotherapy on Depression, Beliefs, and Economic Outcomes
   • The Macroeconomic Impact of Climate Change: Global vs. Local Temperature 
   • Urban Forests: Environmental Health Values and Risks
   • When do “Nudges” Increase Welfare?
   • Zero-Sum Thinking, the Evolution of Effort-Suppressing Beliefs, and Economic Development

⚠️  LLM papers lacking a human match:
   • Acemoglu et al. 2024
   • Alcott et al. 2024
   • Arora et al. 2023
   • Barker et al. 2021
   • Bhat et al. 2022
   • Crawfurd et al. 2023
   • Walker et al. 2023
✅ merged rows: 305  (24 papers × 7 metrics)

Overall correlation

Table 3.1: Sample size and mean absolute difference (MAD) by metric.
  N MAD
metric    
advancing_knowledge 23 10.8
claims_evidence 5 14.7
global_relevance 24 14.8
logic_communication 24 13.1
methods 23 14.2
open_science 24 20.7
overall 24 11.3
Figure 3.1: LLM mid-points vs. human mid-points. Dashed 45° = exact agreement.